fix(connection-manager): add per-address timeout to prevent slow addr…#3412
Conversation
|
Thanks for this @paschal533 Kind of thinking parallel dial with limits (2 or 3 concurrent)... Maybe dial best address first via AddressSorter then after 250ms race the rest? |
If you mean dial them in parallel I would be careful here as this will consume a lot of resources, particularly if the node is doing, for example, DHT things where you expect most dials to fail. This opens the node up to an amplification-style attack where a peer might have a peer record with lots of bogus or slow addresses so a single dial request causes the target node make many dials, all of which are doomed. The current behaviour of dialling addresses sequentially was introduced partly because of the above, but also due to empirical observations that if the first address dialled is a "good" candidate (e.g. public, non-NATed, or supports hole punching) and it fails, all other addresses will likely fail too. That aside, having a shorter per-address timeout signal in addition to an overall dial timeout seems like a reasonable approach. |
|
Thanks for the context @achingbrain |
| /** | ||
| * @see https://libp2p.github.io/js-libp2p/interfaces/index._internal_.ConnectionManagerConfig.html#addressDialTimeout | ||
| */ | ||
| export const ADDRESS_DIAL_TIMEOUT = 6_000 |
There was a problem hiding this comment.
This default looks reasonable.
In the future wondering if it may be beneficial to add a smaller timeout at the raw connection layer. The intention would be to quickly timeout (~4 seconds) when the destination host is offline.
ed86ac1 to
cc1c321
Compare
- Fix JSDoc @see URL to point to public ConnectionManagerInit interface instead of internal ConnectionManagerConfig - Replace timing-based assertion in the sequential-hang test with a signal-based one: listen for abort on each hung dial's options.signal and count aborts. Verifies the per-address AbortSignal was actually fired three times rather than measuring wall-clock elapsed time, which is flaky on loaded CI runners.
|
hey @tabcat, made both changes... fixed the JSDoc URL to point to |
Problem
When dialing a peer with multiple multiaddrs (e.g.
[/ip4/10.2.0.2/tcp/9106, /ip4/127.0.0.1/tcp/9106]), thedialTimeout(default 10s) was shared across all address attempts sequentially. If the first addresshung, for example, a private IP in a different subnet whose TCP SYN is silently dropped, it consumed the entire timeout budget and subsequent reachable addresses were never tried.
Root cause
In
dial-queue.ts, the batch-level signal (AbortSignal.timeout(dialTimeout)) was passed directly to every individualtransportManager.dial()call. One slow address = all addresses fail.Fix
Add
addressDialTimeout(default6_000ms) toConnectionManagerConfig. For each address attempt, a per-address signal is composed from the outer batch signal and a freshAbortSignal.timeout(addressDialTimeout):The addressDialTimeout is configurable via connectionManager.addressDialTimeout for users on slower transports (WebRTC, satellite links) who need more time per address.
Tests
Added a describe('addressDialTimeout') suite with 5 tests:
Closes #2368